library(dotenv)
library(tidyverse)
library(ggplot2)
library(sf)
library(tidycensus)
library(stringr)
library(magrittr)
library(jsonlite)
library(httr)
library(tidymodels)
library(knitr)
library(kableExtra)
library(data.table)
library(patchwork)
library(vip)
# Clone the private github repository for this assignment using SSH link:
# git@github.com:stejamilla/final-project.git
# do not use scientific notation
options(scipen = 999)Final Project
Data Science for Public Policy
About This Project
Urban environments tend to retain heat due to having less tree/vegetation cover, more impervious surfaces such as asphalt, and other factors. Using data on DC’s landcover, we built and tested three different models to predict where hotspots are most likely to occur within the District. We also included several variables from the American Community Survey (ACS) to test whether demographic characteristics could also be predictive of hotspots. Our analysis is at the block group level.
Setup
Loading packages:
Loading and Cleaning the Data
DC Land Cover Data
Loading data from Open Data DC (landcover, ward geometry):
# Loading Urban Tree Canopy by Census Block Group (2020) data
landcover <- read_sf(paste0("data/Urban_Tree_Canopy",
"_by_Census_Block_Group_in_2020.shp")) |>
rename(block_group = GEOID) |>
mutate(block_group = as.character(block_group)) |>
mutate(OBJECTID = as.character(OBJECTID)) |>
rename_all(tolower) |>
relocate(block_group, .after = objectid) |>
# Converting UHI to Fahrenheit
mutate(uhi = (uhi*(9/5) + 32))
# Loading DC Ward data from Urban Tree Canopy by Ward (2020)
wards <- read_sf("data/Urban_Tree_Canopy_by_Ward_in_2020.shp") |>
rename_all(tolower) |>
select(ward) |>
st_transform(crs = st_crs(landcover))Create key with Urban Tree Canopy data column descriptions and clean up:
Categorizing each block group into their respective ward:
# Spatial join so that each block group is assigned a ward based on majority
# area of a block group within a ward
landcover <- st_join(landcover, wards, largest = TRUE)
# Mapping the overlapping wards & block groups
ggplot() +
geom_sf(data = landcover,
aes(fill = ward))STEPHANIE 12/13 - this code says that if a ward intersects with a blockgroup, then classify that block group as a ward. For block groups that overlap multiple wards, it returns the ward that the block group overlaps with the most area (that’s the argument largest = TRUE)
Exploring Census Demographic Data
Load ACS 5-yr 2016-2020 variables:
MADELEINE PRE-12/12: * It occurs to me we may need to choose baseline categories for any categorical variables we include and leave those out? Not sure if that is only a requirement for line and linear regression? JAMIE 12/12 - Good point! I’m adding stuff in below without factoring that in for now, but I think we def need to determine if we need to deal with this! STEPHANIE 12/13 - I’m not too concerned about this because we’ve never had to address categorical variables in the assignments. We can do something like “step_dummy()” if we want to recode a categorical variable into a dummy instead.
Loading Census demographic data via API
# Secure Census API key (from .env file) for use in Census API call
load_dot_env()
credential <- Sys.getenv("census_api_key")
# Create url for API call
url <- str_glue(paste0("https://api.census.gov/data/2020/acs/acs5?get=",
# Description of data point
"NAME,",
# TOTAL POPULATION ESTIMATE
"B01003_001E,",
# TOTAL POPULATION BY RACE - WHITE ALONE
"B02001_002E,",
# TOTAL POPULATION BY RACE - BLACK OR AFR AMER ALONE
"B02001_003E,",
# TOTAL POPULATION BY RACE - AMER IND AND AK NAT ALONE
"B02001_004E,",
# TOTAL POPULATION BY RACE - ASIAN ALONE
"B02001_005E,",
# TOTAL POPULATION BY RACE - NAT HAW OR PAC ISL ALONE
"B02001_006E,",
# TOTAL POPULATION BY RACE - OTHER ALONE + TWO OR MORE RACES
"B02001_008E,",
# TOTAL HISPANIC OR LATINO
"B03003_003E,",
# TOTAL POP WITH DOCTORATE DEGREE
"B15003_025E,",
# TOTAL POP WITH PROFESSIONAL SCHOOL DEGREE
"B15003_024E,",
# TOTAL POP WITH MASTER'S DEGREE
"B15003_023E,",
# TOTAL POP WITH BACHELORS DEGREE
"B15003_022E,",
# TOTAL POP WITH ASSOCIATES DEGREE
"B15003_021E,",
# TOTAL POP WITH REGULAR HIGH SCHOOL DIPLOMA
"B15003_017E,",
# TOTAL POP WITH GED OR ALTERNATIVE CREDENTIAL
"B15003_018E,",
# MEDIAN HH INCOME IN PAST 12 MOS
"B19013_001E,",
# TOTAL HOUSING UNITS
"B25002_001E,",
# TOTAL OCCUPIED HOUSING UNITS
"B25002_002E,",
# TOTAL VACANT HOUSING UNITS
"B25002_003E,",
# MEDIAN PROPERTY VALUE
"B25077_001E,",
# GROSS RENT AS A % OF HH INCOME IN PAST 12 MO -
# TOTAL
"B25070_001E,",
# GROSS RENT AS A % OF HH INCOME IN PAST 12 MO -
# 30 TO 34.9 %
"B25070_007E,",
# GROSS RENT AS A % OF HH INCOME IN PAST 12 MO -
# 35 TO 39.9 %
"B25070_008E,",
# GROSS RENT AS A % OF HH INCOME IN PAST 12 MO -
# 40 TO 49.9 %
"B25070_009E,",
# GROSS RENT AS A % OF HH INCOME IN PAST 12 MO -
# 50 % or more
"B25070_010E",
# Limit to block-group-level data for DC
"&for=block%20group:*&for=tract:*&in=state:11&in=county:001"))
# Use url to request data from API
acs_json <- GET(url = url)
# Check call status
http_status(acs_json)$category
[1] "Success"
$reason
[1] "OK"
$message
[1] "Success: (200) OK"
# Save API JSON response as text
acs_json <- content(acs_json, as = "text")
# Save JSON as character matrix
acs_matrix <- fromJSON(acs_json)
# Convert matrix to tibble
acs_data <- as_tibble(acs_matrix[2:nrow(acs_matrix), ],
.name_repair = "minimal")
# Add variable names to tibble
names(acs_data) <- acs_matrix[1, ] |>
str_replace(" ", "_")Labeling and cleaning ACS variables:
# Clean up ACS data
acs_data <- acs_data |>
# Meaningful variable names
rename(total_pop = B01003_001E,
race_white = B02001_002E,
race_black = B02001_003E,
race_ai_an = B02001_004E,
race_asian = B02001_005E,
race_nh_pi = B02001_006E,
race_oth_mult = B02001_008E,
race_eth_hisp = B03003_003E,
ed_doct = B15003_025E,
ed_prof_deg = B15003_024E,
ed_master = B15003_023E,
ed_bach = B15003_022E,
ed_assoc = B15003_021E,
ed_ged = B15003_018E,
ed_reg_hsd = B15003_017E,
med_hhi = B19013_001E,
med_prop_val = B25077_001E,
rent_burden_tot = B25070_001E,
rent_burden_30_34 = B25070_007E,
rent_burden_35_39 = B25070_008E,
rent_burden_40_49 = B25070_009E,
rent_burden_50plus = B25070_010E,
total_hous_units = B25002_001E,
vac_units = B25002_003E,
occ_units = B25002_002E,
block_group_temp = block_group) |>
# Create block_group column to use for combining this with the other datasets
mutate(block_group = paste(state,
county,
tract,
block_group_temp,
sep = "")) |>
# Convert Census variables to numeric
mutate_at(vars(total_pop:rent_burden_50plus), as.numeric) |>
rename_all(tolower) |>
relocate(block_group, .before = name)
# Convert rent burden variables into a percentage of rent burdened households,
# then dropping original rent burden vars
acs_data <- acs_data |>
mutate(
rent_burden_sum =
rent_burden_30_34 +
rent_burden_35_39 +
rent_burden_40_49 +
rent_burden_50plus
) |>
mutate(
rent_burden_pct = (rent_burden_sum / rent_burden_tot)
) |>
select(-c(rent_burden_30_34,
rent_burden_35_39,
rent_burden_40_49,
rent_burden_50plus,
rent_burden_tot,
rent_burden_sum)
)Data Imputation Part 1
IMPUTATION The imputation logic works as follows: 1) If data is missing at the block group level, use the mean for that block group’s census tract.
- If data is still missing at the census tract level, use the mean for that census tract’s ward. (This happens further down in the code.)
This method is imperfect for two reasons: 1) Imputed values at the census tract level feed into imputation at the ward level. For instance, say a ward is made up of 4 census tracts and each tract has 4 block groups. Initially, only 2/4 tracts have complete data. After imputing at the tract level, 3/4 are complete. When we go to impute the data for the final tract, that calculation will include the values we imputed for the other missing tract. I don’t think this is that big of a deal, especially because the numbers of values that need to be imputed at the ward level is pretty small.
2) Median household income and median property value are top- and bottom-coded (no one has an income less than $2,499 or greater than $250,000). The only way around this I can think of would be to choose one or more cutoffs as I did with the rent burden variable (for example, categorize income into “high income” and “low income”). I didn’t do this because I wasn’t sure what those cutoffs would be. I’m also not really sure how much it matters, since either way we can’t capture the real high- and low-end incomes.
Imputation part 1: replace missing data points with tract-level data.
# Create tract dataframe
tract_data <- acs_data |>
mutate(med_hhi = if_else(
med_hhi < 0, NA, med_hhi),
med_prop_val = if_else(
med_prop_val < 0, NA, med_prop_val)
) |>
group_by(tract) |>
summarize(
rb_tract = mean(rent_burden_pct, na.rm = TRUE),
hhi_tract = mean(med_hhi, na.rm = TRUE),
propval_tract = mean(med_prop_val, na.rm = TRUE)
)
# Replace missing data with tract data:
acs_data <- acs_data |>
left_join(tract_data, by = "tract") |>
# Rent burden
mutate(
# Create indicator for imputed vars
rb_imputed = if_else(
is.na(rent_burden_pct), 1, 0),
# Tract-level imputation
rent_burden_pct = if_else(
is.na(rent_burden_pct), rb_tract, rent_burden_pct)
) |>
# Median HHI
mutate(
# Create indicator for imputed vars
hhi_imputed = if_else(
med_hhi < 0, 1, 0),
# Tract-level imputation
med_hhi = if_else(
med_hhi < 0, hhi_tract, med_hhi)
) |>
# Median property value
mutate(
# Create indicator for imputed vars
propval_imputed = if_else(
med_prop_val < 0, 1, 0),
# Tract-level imputation
med_prop_val = if_else(
med_prop_val < 0, propval_tract, med_prop_val)
)Creating the Final Dataset for Analysis
Deciding threshold for hotspot vs. not hotspot
Deciding the threshold for hot vs. not hot:
# Visualizing a cumulative histogram of counts of UHI values
landcover |>
group_by(uhi) |>
ggplot(aes(uhi)) +
geom_histogram(aes(y = cumsum(..count..)))# Calculating frequency & cumulative frequency
heat_freq <- landcover |>
count(uhi)
heat_freq$cumulative <- cumsum(heat_freq$n)
# Exploring other ways to generate a cutoff
heat_freq |> summarize(
sd(uhi),
mean(uhi)
)Simple feature collection with 1 feature and 2 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 389616.6 ymin: 124877.9 xmax: 407860.4 ymax: 147545.8
Projected CRS: NAD83 / Maryland
# A tibble: 1 × 3
`sd(uhi)` `mean(uhi)` geometry
<dbl> <dbl> <POLYGON [m]>
1 1.80 89.8 ((394387 136007.6, 394387.7 136010.7, 394388 136012.3, …
# The standard deviation of UHI is 1.80
# The mean is 89.8
# 1 SD above mean would be 91.6
heat_freq |>
filter(uhi >= 91.6) |>
count()Simple feature collection with 1 feature and 1 field
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 395375 ymin: 134605.6 xmax: 401893.4 ymax: 143306.4
Projected CRS: NAD83 / Maryland
# A tibble: 1 × 2
n geometry
* <int> <MULTIPOLYGON [m]>
1 82 (((395651.3 136857, 395651.4 136859.2, 395657.1 136859.3, 395664.3 1368…
STEPHANIE 12/15 - I think doing one standard deviation makes sense. It didn’t adversely affect the decision tree model at least, and probably makes more sense than just choosing the mid point.
MADELEINE 12/14: I split up the join from the metrics because the ward-level imputation has to happen in between (after join, before metrics).
Data Imputation Part 2
Imputation part 2: for any data points still missing after imputing at tract level, impute at ward level.
# Create ward-level dataframe
ward_data <- dc_full_data |>
group_by(ward) |>
summarize(
rb_ward = mean(rent_burden_pct, na.rm = TRUE),
hhi_ward = mean(med_hhi, na.rm = TRUE),
propval_ward = mean(med_prop_val, na.rm = TRUE)
) |>
st_drop_geometry()
# Replace missing data with ward data:
# Rent burden
dc_full_data <- dc_full_data |>
left_join(ward_data, by = "ward") |>
mutate(
rent_burden_pct = if_else(
is.na(rb_tract), rb_ward, rent_burden_pct),
med_hhi = if_else(
is.na(hhi_tract), hhi_ward, med_hhi),
med_prop_val = if_else(
is.na(propval_tract), propval_ward, med_prop_val)
)
# Drop columns no longer needed
dc_full_data <- dc_full_data |>
select(-c(rb_tract, rb_ward,
hhi_tract, hhi_ward,
propval_tract, propval_ward))Finalizing full dataset
dc_full_data <- dc_full_data|>
mutate(
# convert total acres to sq miles
total_sq_mi = total_ac * 0.0015625,
# convert water acres to sq miles
water_sq_mi = wat_ac * 0.0015625,
# calculate water share of total sq miles
share_water_sq_mi = water_sq_mi / total_sq_mi,
# calculate population density per square mile
pop_dens_sq_mi = total_pop / total_sq_mi,
# calculate population density per water sq mile
pop_density_water_sq_mi = ifelse(
water_sq_mi !=0, total_pop / water_sq_mi, 0),
# calculate ratio of housing units to population
hous_units_per_person = ifelse(
total_hous_units != 0, total_hous_units / total_pop, 0),
# calculate ratio of vacant units to all units
vac_units_share = ifelse(
total_hous_units != 0, vac_units / total_hous_units, 0),
# calculate ratio of total income to median home value
inc_home_val_ratio = med_hhi / med_prop_val,
# create combined HS diploma or equivalent variable
ed_comb_hsd = ed_reg_hsd + ed_ged,
# create combined advanced post-undergraduate variable
ed_comb_adv_deg = ed_doct + ed_prof_deg + ed_master,
# calculate ratio of 25+ pop with HSD to total pop
ed_hsd_ratio = ifelse(
ed_comb_hsd !=0, ed_comb_hsd / total_pop, 0),
# calculate ratio of 25+ pop with bachelor's degree to total pop
ed_bach_ratio = ifelse(
ed_bach != 0, ed_bach / total_pop, 0),
# calculate ratio of 25+ pop with advanced post-bach degree to total pop
ed_adv_ratio = ifelse(
ed_comb_adv_deg != 0, ed_comb_adv_deg / total_pop, 0),
# calculate pop share white
race_white_ratio = ifelse(
race_white != 0, race_white / total_pop, 0),
# calculate pop share black or african american
race_black_ratio = ifelse(
race_black != 0, race_black / total_pop, 0),
# calculate pop share american indian or alaska native
race_ai_an_ratio = ifelse(
race_ai_an!= 0, race_ai_an / total_pop, 0),
# calculate pop share asian
race_asian_ratio = ifelse(
race_asian != 0, race_asian / total_pop, 0),
# calculate pop share native hawaiian or pacific islander
race_nh_pi_ratio = ifelse(
race_nh_pi != 0, race_nh_pi / total_pop, 0),
# calculate pop share two or more races
race_oth_mult_ratio = ifelse(
race_oth_mult != 0, race_oth_mult / total_pop, 0),
# calculate pop share hispanic or latino
race_eth_hisp_ratio = ifelse(
race_eth_hisp != 0, race_eth_hisp / total_pop, 0),
# calculate total non-white population
race_poc = race_black + race_ai_an + race_asian + race_nh_pi + race_oth_mult,
# calculate pop share non-white
race_poc_ratio = ifelse(
race_poc != 0, race_poc / total_pop, 0),
# creating hotspot var for decision tree model
hotspot = if_else(
uhi >= 91.6, "hotspot", "not hotspot"),
# remove factor from uhi var
uhi = as.numeric(as.character(uhi))) |>
relocate(c("block_group_temp", "tract"), .after = 2) |>
# remove columns with only one unique value
select(-objectid,
-name,
-state,
-statefp,
-county,
-countyfp,
-tractce,
-blkgrpce,
-namelsad,
-mtfcc,
-funcstat,
-field,
-shapearea,
-shapelen) |>
relocate(race_white_ratio:race_eth_hisp_ratio,
.after = race_eth_hisp) |>
relocate(ed_comb_hsd:ed_adv_ratio,
.after = ed_ged) |>
relocate(hous_units_per_person:inc_home_val_ratio,
.after = med_prop_val) |>
relocate(total_sq_mi:pop_density_water_sq_mi,
.after = inc_home_val_ratio) |>
relocate(total_pop:hotspot, .after = tract)Evaluating imputation via visualization
Missing data have been imputed using the next smallest geography available. In order to assess the accuracy of these imputations, we plot the variables with imputed values against another variable with no imputed values, one which should have some correlation with the partially imputed variable. We can then see if the imputed values follow the general trend of the correlation or not.
#### PERCENT RENT BURDENED ####
# It was harder finding variables that are clearly correlated with this than it
# was for some of the other partially-imputed variables. These are the most
# useful results I got.
# % rent burdened vs. POC ratio
dc_full_data |> ggplot() +
geom_point(mapping =
aes(x = ed_bach_ratio, y = rent_burden_pct,
color = as.character(rb_imputed))
) +
labs(color = "imputed (0 = no, 1 = yes)")# Not a super strong trend here, but all imputed data points seem to fit
# the general shape of the data.
# % rent burdened vs. ward
dc_full_data |> ggplot() +
geom_point(mapping =
aes(x = ward, y = rent_burden_pct,
color = as.character(rb_imputed))
) +
labs(color = "imputed (0 = no, 1 = yes)")# All the outliers here (rent_burden_pct = 1 or = 0) are not imputed values.
# Wouldn't want to rely on this alone since we impute using ward-level means
# in a few cases, but alongside the other charts, it provides further reassurance.
#### MEDIAN HOUSEHOLD INCOME ####
# Median household income vs. POC ratio
dc_full_data |> ggplot() +
geom_point(mapping =
aes(x = race_poc_ratio, y = med_hhi,
color = as.character(hhi_imputed))
) +
labs(color = "imputed (0 = no, 1 = yes)")# This looks pretty promising to me in terms of the imputed values being reasonable
# One or two semi-outliers, but imputed values generally follow trend of other data
# Don't think we need to worry too much about values where x = 0;
# these likely had no population data or actually no population
# Median household income vs. bachelor's degree ratio
dc_full_data |> ggplot() +
geom_point(mapping =
aes(x = ed_bach_ratio, y = med_hhi,
color = as.character(hhi_imputed))
) +
labs(color = "imputed (0 = no, 1 = yes)")# The outlier here is really just an outlier on in terms of ed_bach_ratio.
# Arguably the imputed med_hhi should have been higher, but not sure how much
# of an issue this is considering it's only one data point.
# Median household income vs. ward
dc_full_data |> ggplot() +
geom_point(mapping =
aes(x = ward, y = med_hhi,
color = as.character(hhi_imputed))
)# This looks good to me too. Really no imputed values are outliers in their ward.
#### MEDIAN PROPERTY VALUE ####
# Median property value vs. POC ratio
dc_full_data |> ggplot() +
geom_point(mapping =
aes(x = race_poc_ratio, y = med_prop_val,
color = as.character(propval_imputed))
) +
labs(color = "imputed (0 = no, 1 = yes)")# Imputed values seem to fit trend of data quite well.
# Median property value vs. bachelor's degree ratio
dc_full_data |> ggplot() +
geom_point(mapping =
aes(x = ed_bach_ratio, y = med_prop_val,
color = as.character(propval_imputed))
) +
labs(color = "imputed (0 = no, 1 = yes)")# Slightly weirdly shaped data, but the imputed points do not seem to be
# outside the trend. Only outlier is on the ed_bach_ratio var.
# Median property value vs. ward
dc_full_data |> ggplot() +
geom_point(mapping =
aes(x = ward, y = med_prop_val,
color = as.character(propval_imputed))
) +
labs(color = "imputed (0 = no, 1 = yes)")Prepare the data for analysis (set seed & splitting)
JAMIE 12/12 - Changed the var names so that these are the splits we use for all models, not just decision tree
- Do we need an implementation set too?
STEPHANIE 12/13 - Aaron mentioned that we should make sure that when we split the data, that we should stratify the data by some geographic feature so that for example not all the block groups in a certain ward are only in the training data. Because then our model would only really work for that ward. Thus, the argument strata = ward. * Concerning implementation data, I worry that since we only have a set of n = 571 that taking out a number would significantly affect our model. Maybe we can just do a set of 10 data points for the implementation data like we did in Assignment 7, but we’d have to randomize which ten are taken out (which we didn’t do in assignment 7, we sliced the data by where observations were located in the dataset)
Initial Exploratory Data Analysis
Review summary statistics
JAMIE 12/12 - Added summary stats table here
# Review summary statistics
kable(dc_train_summary <- as_tibble(dc_train |>
select(where(is.numeric)) |>
lapply(summary) |>
lapply(`length<-`, 6)) |>
mutate(measure = c("Min", "Q1", "Median", "Mean", "Q3", "Max")) |>
mutate(across(everything(), as.vector)) |>
pivot_longer(-measure) |>
pivot_wider(id_cols = "name",
names_from = "measure",
values_from = "value"))| name | Min | Q1 | Median | Mean | Q3 | Max |
|---|---|---|---|---|---|---|
| total_pop | 0.0000000 | 850.5000000 | 1139.0000000 | 1221.2087912 | 1555.5000000 | 3742.0000000 |
| race_white | 0.0000000 | 71.5000000 | 439.0000000 | 494.4307692 | 795.5000000 | 2335.0000000 |
| race_black | 0.0000000 | 87.0000000 | 348.0000000 | 557.2703297 | 871.5000000 | 3661.0000000 |
| race_ai_an | 0.0000000 | 0.0000000 | 0.0000000 | 4.4747253 | 0.0000000 | 508.0000000 |
| race_asian | 0.0000000 | 0.0000000 | 21.0000000 | 49.1758242 | 60.5000000 | 773.0000000 |
| race_nh_pi | 0.0000000 | 0.0000000 | 0.0000000 | 0.6109890 | 0.0000000 | 71.0000000 |
| race_oth_mult | 0.0000000 | 6.5000000 | 30.0000000 | 52.1736264 | 72.5000000 | 487.0000000 |
| race_eth_hisp | 0.0000000 | 22.0000000 | 78.0000000 | 141.8263736 | 176.5000000 | 1451.0000000 |
| race_white_ratio | 0.0000000 | 0.0682352 | 0.4009009 | 0.4222330 | 0.7418562 | 1.0000000 |
| race_black_ratio | 0.0000000 | 0.0875971 | 0.3272624 | 0.4171346 | 0.7453525 | 1.0000000 |
| race_ai_an_ratio | 0.0000000 | 0.0000000 | 0.0000000 | 0.0032669 | 0.0000000 | 0.3793876 |
| race_asian_ratio | 0.0000000 | 0.0000000 | 0.0184049 | 0.0401567 | 0.0522593 | 0.7630800 |
| race_nh_pi_ratio | 0.0000000 | 0.0000000 | 0.0000000 | 0.0005430 | 0.0000000 | 0.0817536 |
| race_oth_mult_ratio | 0.0000000 | 0.0048783 | 0.0263158 | 0.0427554 | 0.0570762 | 0.3453901 |
| race_eth_hisp_ratio | 0.0000000 | 0.0217248 | 0.0674157 | 0.1061547 | 0.1591406 | 0.6898551 |
| ed_doct | 0.0000000 | 0.0000000 | 19.0000000 | 35.4373626 | 54.5000000 | 291.0000000 |
| ed_prof_deg | 0.0000000 | 9.0000000 | 46.0000000 | 76.7472527 | 121.0000000 | 465.0000000 |
| ed_master | 0.0000000 | 65.5000000 | 156.0000000 | 183.4659341 | 277.0000000 | 840.0000000 |
| ed_bach | 0.0000000 | 100.0000000 | 189.0000000 | 218.6505495 | 308.5000000 | 798.0000000 |
| ed_assoc | 0.0000000 | 0.0000000 | 14.0000000 | 26.7780220 | 40.0000000 | 413.0000000 |
| ed_reg_hsd | 0.0000000 | 12.0000000 | 74.0000000 | 123.2065934 | 184.5000000 | 907.0000000 |
| ed_ged | 0.0000000 | 0.0000000 | 4.0000000 | 23.2065934 | 27.0000000 | 498.0000000 |
| ed_comb_hsd | 0.0000000 | 18.5000000 | 88.0000000 | 146.4131868 | 219.0000000 | 1133.0000000 |
| ed_comb_adv_deg | 0.0000000 | 92.5000000 | 260.0000000 | 295.6505495 | 461.0000000 | 1094.0000000 |
| ed_hsd_ratio | 0.0000000 | 0.0169049 | 0.0785219 | 0.1084304 | 0.1757625 | 0.5833980 |
| ed_bach_ratio | 0.0000000 | 0.0899638 | 0.1766304 | 0.1888757 | 0.2700129 | 1.0000000 |
| ed_adv_ratio | 0.0000000 | 0.0844247 | 0.2486931 | 0.2575276 | 0.4085868 | 1.0000000 |
| med_hhi | 2499.0000000 | 60766.5000000 | 96354.0000000 | 104099.6783699 | 135677.0000000 | 250001.0000000 |
| total_hous_units | 0.0000000 | 381.5000000 | 525.0000000 | 554.1890110 | 717.5000000 | 1560.0000000 |
| occ_units | 0.0000000 | 349.0000000 | 469.0000000 | 500.9538462 | 643.5000000 | 1482.0000000 |
| vac_units | 0.0000000 | 13.5000000 | 42.0000000 | 53.2351648 | 80.5000000 | 368.0000000 |
| med_prop_val | 146700.0000000 | 407616.6666667 | 597900.0000000 | 637116.7003945 | 788800.0000000 | 2000001.0000000 |
| hous_units_per_person | 0.0000000 | 0.3621221 | 0.4486621 | 0.4728340 | 0.5845168 | 1.3872340 |
| vac_units_share | 0.0000000 | 0.0255810 | 0.0763501 | 0.0919955 | 0.1296251 | 0.8938053 |
| inc_home_val_ratio | 0.0055447 | 0.1164494 | 0.1602061 | 0.1707757 | 0.2154070 | 0.4492911 |
| total_sq_mi | 0.0065000 | 0.0373203 | 0.0621406 | 0.1194089 | 0.1161562 | 4.4441719 |
| water_sq_mi | 0.0000000 | 0.0000000 | 0.0000000 | 0.0124242 | 0.0000000 | 1.9424688 |
| share_water_sq_mi | 0.0000000 | 0.0000000 | 0.0000000 | 0.0161898 | 0.0000000 | 0.6663825 |
| pop_dens_sq_mi | 0.0000000 | 8937.9885706 | 18233.3333333 | 23744.1005885 | 30555.1000276 | 167846.1538462 |
| pop_density_water_sq_mi | 0.0000000 | 0.0000000 | 0.0000000 | 934308.6588760 | 0.0000000 | 88576000.0000000 |
| rent_burden_pct | 0.0000000 | 0.2792525 | 0.4251969 | 0.4166570 | 0.5592122 | 1.0000000 |
| race_poc | 0.0000000 | 208.0000000 | 500.0000000 | 663.7054945 | 967.5000000 | 3680.0000000 |
| race_poc_ratio | 0.0000000 | 0.2168948 | 0.4611239 | 0.5038566 | 0.8234325 | 1.0000000 |
| aland | 16848.0000000 | 96644.5000000 | 159891.0000000 | 276436.3274725 | 294718.0000000 | 6514228.0000000 |
| awater | 0.0000000 | 0.0000000 | 0.0000000 | 32846.4923077 | 0.0000000 | 4996439.0000000 |
| gid | 3.0000000 | 154.5000000 | 293.0000000 | 289.4483516 | 427.5000000 | 571.0000000 |
| total_ac | 4.1600000 | 23.8850000 | 39.7700000 | 76.4216923 | 74.3400000 | 2844.2700000 |
| land_ac | 4.1600000 | 23.8850000 | 39.5100000 | 68.4702418 | 73.2150000 | 1601.0900000 |
| tcan_ac | 0.1200000 | 4.8400000 | 9.9100000 | 25.5966593 | 24.2900000 | 647.3400000 |
| tcan_pct | 0.7400000 | 17.7600000 | 25.6500000 | 28.4931209 | 36.0150000 | 87.9600000 |
| veg_ac | 0.0200000 | 1.5500000 | 4.6600000 | 11.5154066 | 12.3500000 | 541.3000000 |
| veg_pct | 0.3000000 | 5.9150000 | 11.0300000 | 12.1633407 | 17.2700000 | 43.9300000 |
| bld_ac | 0.8700000 | 5.9200000 | 8.3300000 | 10.8100000 | 11.9700000 | 147.0600000 |
| bld_pct | 2.2700000 | 14.4550000 | 20.0600000 | 22.3882637 | 30.7750000 | 58.9000000 |
| to_ia_ac | 3.1600000 | 15.2350000 | 22.5300000 | 31.2712967 | 35.1900000 | 608.0700000 |
| to_ia_pct | 6.2500000 | 42.6600000 | 57.8700000 | 57.6620440 | 74.3250000 | 97.0800000 |
| road_ac | 0.6800000 | 3.6900000 | 5.6700000 | 8.1374286 | 9.4200000 | 207.7900000 |
| road_pct | 2.0600000 | 10.5000000 | 14.3200000 | 14.4510110 | 17.8550000 | 37.6900000 |
| plot_ac | 0.0000000 | 0.5450000 | 1.4200000 | 3.2879121 | 3.0250000 | 109.0900000 |
| plot_pct | 0.0000000 | 1.6700000 | 3.2900000 | 4.7149451 | 6.5850000 | 31.6400000 |
| swalk_ac | 0.2100000 | 1.1250000 | 1.7600000 | 2.7635385 | 2.9100000 | 140.0800000 |
| swalk_pct | 0.1900000 | 3.0100000 | 4.3200000 | 5.1728791 | 6.4500000 | 18.7100000 |
| ot_ia_ac | 0.1200000 | 2.6100000 | 4.4700000 | 6.2724396 | 7.3800000 | 82.4500000 |
| ot_ia_pct | 1.0500000 | 6.8100000 | 10.5900000 | 10.9345275 | 14.1700000 | 36.6800000 |
| wat_ac | 0.0000000 | 0.0000000 | 0.0000000 | 7.9514945 | 0.0000000 | 1243.1800000 |
| wat_pct | 0.0000000 | 0.0000000 | 0.0000000 | 1.6189734 | 0.0000000 | 66.6383791 |
| soil_ac | 0.0000000 | 0.0000000 | 0.0000000 | 0.0868791 | 0.0100000 | 7.9100000 |
| soil_pct | 0.0000000 | 0.0000000 | 0.0000000 | 0.0626154 | 0.0300000 | 2.6000000 |
| utc_ac | 0.1200000 | 4.8400000 | 9.9100000 | 25.5966593 | 24.2900000 | 647.3400000 |
| utc_pct | 0.7400000 | 18.1100000 | 25.7700000 | 28.9163297 | 36.4900000 | 88.3300000 |
| ppa_v_ac | 0.0100000 | 1.3450000 | 4.1400000 | 10.2947253 | 11.7350000 | 262.4600000 |
| ppa_v_pct | 0.2100000 | 5.1750000 | 10.1600000 | 11.6471648 | 17.0750000 | 43.9000000 |
| to_ppa_ac | 0.5600000 | 3.9300000 | 8.1300000 | 16.2005275 | 17.8450000 | 404.5100000 |
| to_ppa_pct | 5.7100000 | 14.3600000 | 20.2400000 | 21.6425714 | 27.5100000 | 57.7600000 |
| ppa_ia_ac | 0.2700000 | 1.7900000 | 3.3000000 | 5.9056264 | 5.9950000 | 142.0500000 |
| ppa_ia_pct | 0.4300000 | 5.4900000 | 8.9000000 | 9.9955604 | 13.4500000 | 35.9400000 |
| un_v_ac | 0.0000000 | 0.0000000 | 0.0300000 | 1.1835385 | 0.4800000 | 277.3300000 |
| un_v_pct | 0.0000000 | 0.0000000 | 0.0900000 | 0.7950549 | 0.6650000 | 17.3200000 |
| to_un_ac | 1.9300000 | 12.8500000 | 19.4900000 | 26.6733626 | 30.1100000 | 751.2900000 |
| to_un_pct | 5.8400000 | 36.3800000 | 49.5500000 | 49.4410769 | 64.0750000 | 90.9600000 |
| un_ia_ac | 1.9200000 | 12.7650000 | 19.4500000 | 25.4088132 | 29.1600000 | 467.5400000 |
| un_ia_pct | 5.5900000 | 34.9250000 | 47.5100000 | 48.5798022 | 63.2750000 | 86.4700000 |
| un_sl_ac | 0.0000000 | 0.0000000 | 0.0000000 | 0.0808571 | 0.0100000 | 6.4300000 |
| un_sl_pct | 0.0000000 | 0.0000000 | 0.0000000 | 0.0659780 | 0.0300000 | 3.6300000 |
| wtr_ac | 0.0000000 | 0.0000000 | 0.0000000 | 7.9514945 | 0.0000000 | 1243.1800000 |
| wtr_pct | 0.0000000 | 0.0000000 | 0.0000000 | 2.7711429 | 0.0000000 | 199.7500000 |
| utc_ac_06 | 0.1200000 | 4.1550000 | 9.1100000 | 24.7237802 | 23.1050000 | 645.8900000 |
| utc_pct_06 | 0.3500583 | 15.4988558 | 23.5890767 | 27.0868853 | 34.7122435 | 87.4272365 |
| utcac0620 | -26.2600000 | -0.2100000 | 0.5900000 | 0.8728791 | 1.8050000 | 44.5900000 |
| utcpct0620 | -20.8511393 | -0.6573329 | 2.0519734 | 1.8294444 | 4.5201446 | 19.9808700 |
| utc_ac_11 | 0.1400000 | 4.7800000 | 9.3100000 | 25.3473626 | 22.8700000 | 648.8500000 |
| utc_pct_11 | 2.1336761 | 17.0570509 | 26.0317460 | 28.5631563 | 36.5599560 | 88.5914522 |
| utcac1120 | -44.1500000 | -0.8250000 | 0.0600000 | 0.2492967 | 1.1650000 | 50.5300000 |
| utcpct1120 | -21.2878206 | -2.0987933 | 0.1500000 | 0.3531734 | 2.6685408 | 14.4022283 |
| utc_ac_15 | 0.1300000 | 4.8000000 | 10.4100000 | 26.4955824 | 24.2500000 | 656.6800000 |
| utc_pct_15 | 2.4164524 | 16.5306141 | 26.2942431 | 29.5338941 | 37.5922255 | 89.9739583 |
| utcac1520 | -43.0600000 | -1.1300000 | -0.1900000 | -0.8989231 | 0.3400000 | 7.4900000 |
| utcpct1520 | -13.2604114 | -2.4741031 | -0.5030109 | -0.6175644 | 1.3392326 | 6.8620961 |
| enrgy_con | 0.0000000 | 0.8132393 | 2.6790648 | 4.7613427 | 7.2676798 | 35.1203240 |
| strm_wtr | 0.5867829 | 5.5386833 | 9.0813977 | 10.0328023 | 13.4333690 | 35.9528898 |
| uhi | 82.7737511 | 88.5480260 | 90.0977402 | 89.7619244 | 91.0952792 | 93.4873438 |
| land_own | 0.0000000 | 0.0011507 | 0.3909353 | 2.5219515 | 2.5708216 | 50.9486314 |
| wldlf_con | 0.0000000 | 0.0000000 | 0.0695619 | 4.0515448 | 7.2214827 | 20.8969048 |
| overall | 5.2997772 | 6.9158712 | 7.1840150 | 7.2340716 | 7.5190972 | 9.4915959 |
| low_utc | 11.6700000 | 63.5100000 | 74.2300000 | 71.0836703 | 81.8900000 | 99.2600000 |
| pos_utc | 5.7100000 | 14.3600000 | 20.2400000 | 21.6425714 | 27.5100000 | 57.7600000 |
| ward | 1.0000000 | 3.0000000 | 5.0000000 | 4.5516484 | 6.0000000 | 8.0000000 |
| geometry | 455.0000000 | 0.0000000 | 0.0000000 | NA | NA | NA |
Exploration of some of the continuous variables
JAMIE 12/12 - when we run the below viz we have a warning message that 56 rows were removed.
- I silenced the total ed attainment ggplot per by above note.
STEPHANIE 12/13 - concerning the rows being removed, I think that’s fine. I think that’s happened with me for other visualizations before. * Concerning the ed attainment, I’m getting a thing that “educ_total” doesn’t exist…so I’m not sure what’s happening here JAMIE 12/13 - I removed the var that went into educ_total bc it’s just the total line in the education table. So basically if we want to use a data viz here we need to use the categorical vars.
Exploring temperature in relation to other variables
JAME 12/12 - I fixed this - I guess we have to left_join with the sf df in the first position in left_join(), so that we join the non-sf df onto the sf df and it stays sf. BUT do we have to st_transform() since the crs is not 4326?
STEPHANIE 12/13 - It doesn’t matter that the CRS isn’t 4326 for mapping purposes - the main concern with CRS is that it has to be the same between two spatial datasets if you’re going to do something like join the two. I’m also realizing that we probably shouldn’t do a map here with dc_train because this map shows which block groups are in the training data (vs. the testing data) since you can see which block groups are excluded from the map…so this may be a form of data leakage. I’m going to instead use the dc_full_data dataset.
dc_full_data |> ggplot() +
geom_sf(
mapping = aes(fill = uhi),
color = "#400001"
) +
scale_fill_gradient(
low = "#ffffff",
high = "#be2b25"
) +
labs(
title = "Average evening temperature, 2018",
fill = "Temperature (degrees F)"
) +
theme_void()Temperature & population:
# Total population
dc_train |>
ggplot(aes(x = total_pop, y = uhi, color = hotspot)) +
geom_point(alpha = 0.5) +
theme_minimal()# Population density
dc_train |>
ggplot(aes(x = pop_dens_sq_mi, y = uhi, color = hotspot)) +
geom_point(alpha = 0.5) +
theme_minimal()# Non-white population density
dc_train |>
ggplot(aes(x = race_poc_ratio, y = uhi, color = hotspot)) +
geom_point(alpha = 0.5) +
theme_minimal()Temperature & income/education:
# Median household income
dc_train |>
ggplot(aes(x = med_hhi, y = uhi, color = hotspot)) +
geom_point(alpha = 0.5) +
theme_minimal()# Ratio of bachelors degrees
dc_train |>
ggplot(aes(x = ed_bach_ratio, y = uhi, color = hotspot)) +
geom_point(alpha = 0.5) +
theme_minimal()Temperature & landcover:
# Comparing with percentage of non-canopy vegetation
dc_train |>
ggplot(aes(x = veg_pct, y = uhi, color = hotspot)) +
geom_point(alpha = 0.5) +
theme_minimal()# Comparing with percentage area of urban tree canopy
dc_train |>
ggplot(aes(x = utc_pct, y = uhi, color = hotspot)) +
geom_point(alpha = 0.5) +
theme_minimal()# Comparing with percentage area of impervious surfaces (e.g., roads, parking
# lots, etc.)
dc_train |>
ggplot(aes(x = to_ia_pct, y = uhi, color = hotspot)) +
geom_point(alpha = 0.5) +
theme_minimal()STEPHANIE 12/13 - The graphs that compare UHI with percentage area of urban tree canopy and of impervious surfaces seems quite compelling. This makes sense, and it’s interesting to notice how not a lot of the demographic stuff doesn’t seem to show up in EDA but the landcover stuff seem like promising predictors.
Preparation for Modeling
STEPHANIE 12/13 - I added “-tract” and “-ward” to remove more geographic indicators that likely would affect the model, since that otherwise would’ve been a main predictor of a hotspot
JAMIE 12/12 - Do we want folds? added here. Not using folds does not make the lm rank deficient model warning go away. STEPHANIE 12/13 - I think yes, we’ll have to use it, especially for hyper parameter tuning. My main issue with folding at this point is that I think we need to remove all the variables we don’t want from dc_train first and then get do folding so that the folding process doesn’t take them into account. So I think we should take out the variables from dc_train before the lm and rf models (using select rather than step_rm) and then do the folding. JAMIE 12/13 - Relocated these steps per discussion.
MADELEINE 12/14 - should hotspot be kept in here? there is a comment saying “remove duplicative hotspot”, but then the next code chunk seems to need it
STEPHANIE 12/15 - that’s a good point, Madeleine; we’ll have to remove it for lm and rf but need to keep it in for decision tree
# Prep the dataset for modeling
dc_train <- dc_train |>
st_drop_geometry() |>
select(
# remove identifying vars
-block_group,
-intptlat,
-intptlon,
-gis_id,
-globalid, gid,
-tract,
-ward,
-block_group_temp,
# remove raw total pop var (custom pop density and other pop proportion
# vars remain)
-total_pop,
# remove raw race vars (custom proportion vars remain)
-race_white,
-race_black,
-race_ai_an,
-race_asian,
-race_nh_pi,
-race_oth_mult,
-race_eth_hisp,
# remove raw ed vars (custom proportion vars remain)
-ed_doct,
-ed_prof_deg,
-ed_master,
-ed_bach,
-ed_assoc,
-ed_reg_hsd,
-ed_ged,
-ed_comb_hsd,
-ed_comb_adv_deg,
# remove duplicative housing data (custom proportion vars and pct var
# remain)
-total_hous_units,
-occ_units,
-vac_units,
# remove duplicative land data (custom proportion / "pct" vars remain)
-ends_with("_ac"),
-total_sq_mi,
-water_sq_mi,
-wtr_pct,
-aland,
-awater,
-utcac0620,
-utcac1120,
-utcac1520,
-utcpct0620,
-utcpct1120,
-utc_pct_06,
-utc_pct_11,
-utc_pct_15,
# remove pop_density_water_sq_mi due to INF value for most records
-pop_density_water_sq_mi)
# V-fold cross validation with 10 folds
dc_folds <- vfold_cv(data = dc_train, v = 10)Decision Tree
Modeling
dt_recipe <- recipe(formula = hotspot ~ ., data = dc_train) |>
# remove uhi var to avoid duplication of data from hotspot
step_rm(uhi) |>
themis::step_downsample(hotspot) |>
step_nzv(all_numeric_predictors())
dt_model <- decision_tree() |>
set_engine(engine = "rpart") |>
set_mode(mode = "classification")
dt_workflow <- workflow() |>
add_recipe(dt_recipe) |>
add_model(dt_model)
dt_fitting <- dt_workflow |>
fit(data = dc_train)
rpart.plot::rpart.plot(x = dt_fitting$fit$fit$fit,
roundint = FALSE)STEPHANIE 12/15 - Retaining comments for write-up in case we want to utilize them:
Note on results: I think we have an interesting model. The first decision point is whether the percentage of the block group that’s covered by tree canopy; if it’s greater or equal to 23% then it’s not a hotspot. If it is less than 23%, then you check how suitable the block group is for tree planting “based on a weighted formula that includes all planting prioritization categories.” If the score is less than 7, then it’s a not hotspot. If it’s greater than or equal to seven, you check the percent of total area that’s covered by all combined impervious surfaces. If the percentage is less than 66%, then it’s not a hotspot. If it’s greater than or equal to 66%, then the final check is whether the number of housing units per person is less than 0.74. If it’s greater than or equal to 0.75, then it’s not a hotspot. If it is less than 0.74 then it is a hotspot.
Note on methods: STEPHANIE 12/13 - I’ve decided to keep in downsampling because I think it’s useful. For reference, down-sampling “randomly subset all the classes in the training set so that their class frequencies match the least prevalent class. For example, suppose that 80% of the training set samples are the first class and the remaining 20% are in the second class. Down-sampling would randomly sample the first class to be the same size as the second class (so that only 40% of the total training set is used to fit the model).”
Visualizing Important Variables
STEPHANIE 12/13 - Using dc_full_data as the dataset since it’s possible we may accidentally do data leakage when mapping dc_train because we’ll notice which block groups were excluded.
canopy_map <- dc_full_data |> ggplot() +
geom_sf(aes(fill = utc_pct)) +
scale_fill_gradient(
low = "#fcbc32",
high = "#4e974b"
) +
theme_void() +
labs(
title = "Percentage of tree canopy",
subtitle = "by Block Group",
fill = "Percent Area",
caption = "Open Data DC (2020)"
)
planting_map <- dc_full_data |> ggplot() +
geom_sf(aes(fill = overall)) +
scale_fill_gradient(
low = "#e2e2e2",
high = "#16501a"
) +
theme_void() +
labs(
title = "Tree planting suitability score",
subtitle = "by Block Group",
fill = "Score",
caption = "Open Data DC (2020)"
)
impervious_map <- dc_full_data |> ggplot() +
geom_sf(aes(fill = to_ia_pct)) +
scale_fill_gradient(
low = "#ffffff",
high = "#28292b"
) +
theme_void() +
labs(
title = "Percentage covered by impervious surfaces",
subtitle = "by Block Group",
fill = "Percent Area",
caption = "Open Data DC (2020)"
)
housing_map <- dc_full_data |> ggplot() +
geom_sf(aes(fill = hous_units_per_person)) +
scale_fill_gradient(
low = "#d86400",
high = "#0085d4"
) +
theme_void() +
labs(
title = "Housing units per person",
subtitle = "by Block Group",
fill = "# Housing Units",
caption = "ACS 2016-2020"
)
canopy_map + planting_map + impervious_map + housing_mapEvaluating the Model
STEPHANIE 12/15 - Adding this evaluation in since we need to assess the decision tree model on its own (can’t really compare it to rf and lm since they have different outcome variables).
Creating predictions
# Ensuring the testing dataset has the same variables as the training dataset
dc_test <- dc_test |>
st_drop_geometry() |>
select(
# remove identifying vars
-block_group,
-intptlat,
-intptlon,
-gis_id,
-globalid, gid,
-tract,
-ward,
-block_group_temp,
# remove raw total pop var (custom pop density and other pop proportion
# vars remain)
-total_pop,
# remove raw race vars (custom proportion vars remain)
-race_white,
-race_black,
-race_ai_an,
-race_asian,
-race_nh_pi,
-race_oth_mult,
-race_eth_hisp,
# remove raw ed vars (custom proportion vars remain)
-ed_doct,
-ed_prof_deg,
-ed_master,
-ed_bach,
-ed_assoc,
-ed_reg_hsd,
-ed_ged,
-ed_comb_hsd,
-ed_comb_adv_deg,
# remove duplicative housing data (custom proportion vars and pct var
# remain)
-total_hous_units,
-occ_units,
-vac_units,
# remove duplicative land data (custom proportion / "pct" vars remain)
-ends_with("_ac"),
-total_sq_mi,
-water_sq_mi,
-wtr_pct,
-aland,
-awater,
-utcac0620,
-utcac1120,
-utcac1520,
-utcpct0620,
-utcpct1120,
-utc_pct_06,
-utc_pct_11,
-utc_pct_15,
# remove pop_density_water_sq_mi due to INF value for most records
-pop_density_water_sq_mi)
# Creating predictions dataset
dt_predictions <- bind_cols(
dc_test,
predict(object = dt_fitting, new_data = dc_test),
predict(object = dt_fitting, new_data = dc_test, type = "prob")
)
# Editing column names so it's readable
names(dt_predictions) <- names(dt_predictions) |>
str_replace(" ", "_")
# Printing a few predictions
dt_predictions |>
select(hotspot, .pred_class, .pred_hotspot, .pred_not_hotspot) |>
print(n = 20)# A tibble: 116 × 4
hotspot .pred_class .pred_hotspot .pred_not_hotspot
<chr> <fct> <dbl> <dbl>
1 not hotspot not hotspot 0.116 0.884
2 not hotspot not hotspot 0.116 0.884
3 not hotspot not hotspot 0.116 0.884
4 not hotspot not hotspot 0.116 0.884
5 not hotspot not hotspot 0.375 0.625
6 not hotspot not hotspot 0.116 0.884
7 not hotspot not hotspot 0.375 0.625
8 not hotspot not hotspot 0.116 0.884
9 not hotspot not hotspot 0.116 0.884
10 not hotspot not hotspot 0.116 0.884
11 not hotspot not hotspot 0.116 0.884
12 not hotspot not hotspot 0.116 0.884
13 not hotspot not hotspot 0.116 0.884
14 not hotspot not hotspot 0.263 0.737
15 not hotspot not hotspot 0.116 0.884
16 not hotspot not hotspot 0.116 0.884
17 not hotspot not hotspot 0.116 0.884
18 not hotspot not hotspot 0.116 0.884
19 not hotspot not hotspot 0.116 0.884
20 not hotspot not hotspot 0.116 0.884
# ℹ 96 more rows
Creating confusion matrix
# Transforming the hotspot variable to become a factor variable
dt_predictions$hotspot <- as.factor(dt_predictions$hotspot)
# Creating confusion matrix
conf_mat(data = dt_predictions,
truth = hotspot,
estimate = .pred_class) Truth
Prediction hotspot not hotspot
hotspot 16 13
not hotspot 1 86
Calculating accuracy, precision, and recall/sensitivity:
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.879
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 precision binary 0.552
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 recall binary 0.941
Linear Regression
JAMIE 12/12 - Added recipe for lm and rf and added lm model and fit below. Having trouble with this major warning: A | warning: prediction from rank-deficient fit; consider predict(., rankdeficient=“NA”) Also, I experimented here with removing all the columns that seemed duplicative. I did not add this to the dt recipe for now. Let me know what you all think of removing these.
STEPHANIE 12/13 - The main issue are the columns that have NA values. We need to decide if we should remove those or if it’s possible for us to do data imputation aka decide what those values should be. I removed the identifying factor variables becuase I had to remove them earlier from dc_train.
# Create recipe for use in linear regression and random forest models
# Create the recipe
dc_recipe <- recipe(formula = uhi ~ ., data = dc_train) |>
# remove duplicative hotspot
step_rm(hotspot) |>
# Removing predictors that have near zero variability
step_nzv(all_predictors()) |>
# Removing predictors that are highly correlated with others
step_corr(all_predictors())Modelling
# Linear regression model
dc_lm_model <-
linear_reg() |>
set_mode(mode = "regression") |>
set_engine(engine = "lm")
# Linear regression workflow
dc_lm_workflow <-
workflow() |>
add_recipe(recipe = dc_recipe) |>
add_model(spec = dc_lm_model)
unloadNamespace("Metrics")
dc_lm_resamples <- dc_lm_workflow |>
fit_resamples(resamples = dc_folds)STEPHANIE 12/15 - I took out fitting the model because we only really have to fit the best model between linear regression and random forests, since the purpose of fitting the model is to then create predictions from it, and we’d only want to create predictions based on the best model.
Random Forests
Modelling
Hyperparameter Tuning
dc_rf_model <-
rand_forest(
trees = 200,
mtry = tune(),
min_n = tune()) |>
set_mode(mode = "regression") |>
set_engine(
engine = "ranger",
importance = "impurity",
num.threads = 4
)
# Create a parameter grid
dc_rf_grid <- grid_regular(mtry(range = c(1, 15)),
min_n(range = c(1, 15)),
levels = 5)
# Random forest workflow
dc_rf_workflow <-
workflow() |>
add_recipe(dc_recipe) |>
add_model(dc_rf_model)
# Creating a grid to show results of hyper parameter tuning
dc_rf_grid <- grid_regular(
mtry(range = c(1, 15)),
min_n(range = c(1, 15)),
levels = 5
)
# Tuning the grid
rf_resamples <- tune_grid(
dc_rf_workflow,
resamples = dc_folds,
grid = dc_rf_grid
)
# Showing the best hyperparameters
show_best(rf_resamples)# A tibble: 5 × 8
mtry min_n .metric .estimator mean n std_err .config
<int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 15 8 rmse standard 0.890 10 0.0338 Preprocessor1_Model15
2 15 1 rmse standard 0.891 10 0.0317 Preprocessor1_Model05
3 8 4 rmse standard 0.893 10 0.0324 Preprocessor1_Model08
4 11 1 rmse standard 0.893 10 0.0329 Preprocessor1_Model04
5 8 1 rmse standard 0.893 10 0.0315 Preprocessor1_Model03
STEPHANIE 12/15 - I took out the kable() argument from the line of code when we make dc_rf_grid becuase we don’t really need to see the grid at that point; what matters is the last line of code when you show which hyperparameters are the best ones to use. It looks like mtry = 15 and min_n = 4 are the best options.
Implementing best hyperparameters based on tuning
# Filling in hyperparameters based on tuning
dc_rf_model <- rand_forest(
trees = 200,
mtry = 15,
min_n = 4
) |>
set_mode(mode = "regression") |>
set_engine(
engine = "ranger",
importance = "impurity",
num.threads = 4
)
# Random forest workflow
dc_rf_workflow <-
workflow() |>
add_recipe(dc_recipe) |>
add_model(dc_rf_model)
rf_resamples <- dc_rf_workflow |>
fit_resamples(resamples = dc_folds)Comparing Models
Plot the RMSEs across each fit:
JAMIE 12/12 - Below is what I used in the assignment 07 hw to pull the metrics in the end, so pasted here and updated a bit. It renders!
STEPHANIE 12/15 - I edited the code to be a bit more simple (also based on what I submitted for assignment 07)
# Linear Regression: Calculating and plotting the RMSE for each resample
collect_metrics(
dc_lm_resamples,
summarize = FALSE) |>
filter(.metric == "rmse") |>
print() |>
ggplot(mapping = aes(x = id, y = .estimate, group = .metric)) +
geom_point() +
geom_line()# A tibble: 10 × 5
id .metric .estimator .estimate .config
<chr> <chr> <chr> <dbl> <chr>
1 Fold01 rmse standard 0.000411 Preprocessor1_Model1
2 Fold02 rmse standard 0.000590 Preprocessor1_Model1
3 Fold03 rmse standard 0.216 Preprocessor1_Model1
4 Fold04 rmse standard 0.000650 Preprocessor1_Model1
5 Fold05 rmse standard 0.000578 Preprocessor1_Model1
6 Fold06 rmse standard 0.000634 Preprocessor1_Model1
7 Fold07 rmse standard 0.000628 Preprocessor1_Model1
8 Fold08 rmse standard 0.174 Preprocessor1_Model1
9 Fold09 rmse standard 0.000629 Preprocessor1_Model1
10 Fold10 rmse standard 0.000708 Preprocessor1_Model1
# Random Forests: Calculating and plotting the RMSE for each resample
rf_resamples |>
collect_metrics(summarize = FALSE) |>
filter(.metric == "rmse") |>
print() |>
ggplot(mapping = aes(x = id, y = .estimate, group = .metric)) +
geom_point() +
geom_line()# A tibble: 10 × 5
id .metric .estimator .estimate .config
<chr> <chr> <chr> <dbl> <chr>
1 Fold01 rmse standard 0.786 Preprocessor1_Model1
2 Fold02 rmse standard 0.942 Preprocessor1_Model1
3 Fold03 rmse standard 0.795 Preprocessor1_Model1
4 Fold04 rmse standard 0.884 Preprocessor1_Model1
5 Fold05 rmse standard 0.811 Preprocessor1_Model1
6 Fold06 rmse standard 0.755 Preprocessor1_Model1
7 Fold07 rmse standard 0.999 Preprocessor1_Model1
8 Fold08 rmse standard 0.855 Preprocessor1_Model1
9 Fold09 rmse standard 1.01 Preprocessor1_Model1
10 Fold10 rmse standard 1.00 Preprocessor1_Model1
# Comparing average rmse
bind_rows(
"Linear regression" = collect_metrics(dc_lm_resamples) |>
filter(.metric == "rmse"),
'Random forest' = collect_metrics(rf_resamples) |>
filter(.metric == "rmse"),
.id = "model"
)# A tibble: 2 × 7
model .metric .estimator mean n std_err .config
<chr> <chr> <chr> <dbl> <int> <dbl> <chr>
1 Linear regression rmse standard 0.0394 10 0.0261 Preprocessor1_Model1
2 Random forest rmse standard 0.885 10 0.0314 Preprocessor1_Model1
STEPHANIE 12/15 - Based on this, it seems that linear regression is the better model over random forests by a fairly large margin, which I find interesting since I think usually random forest models out perform other models like lm and regression trees?
Estimate the out-of-sample error rate
JAMIE 12/13 - Added metrics work below STEPHANIE 12/15 - Sorry for my ignorance, but I’m not entirely sure what we’re doing here - is it finding another metric to compare models? Based on the RMSE values above, it seems we should go with the regression model and then create predictions with that and then visualize the best predictors. MADELEINE 12/15 - also, this chunk does not run because dc_lm_fit is not defined. Setting it to not evaluate for now
library(Metrics)
# save best lm fit
best_lm_fit <- fit_best(dc_lm_fit, verbose = TRUE)
# make predictions using best fit
test_predictions <-
dc_test |>
select(uhi) |>
rename(y = uhi) |>
mutate(y = as.numeric(y)) |>
mutate(y_hat = as.numeric(as_vector(predict(best_lm_fit, dc_test))))
# collect rf metrics
kable(rmse(actual = test_predictions$y, predicted = test_predictions$y_hat))Implement the Final Model
STEPHANIE 12/13 - Setting up the final code block for when we decide which model we should make predictions from. This probably won’t render because the random forest model isn’t working at the time I’m writing this (7pm) MADELEINE 12/15 - not sure if more needs to be done here, but when I just replace the argument of extract_workflow with dc_lm_model, I get two errors. 1) select(-hotspot): hotspot does not exist. 2) no applicable method for ‘extract_workflow’ applied to an object of class “c(‘linear_reg’, ‘model_spec’)” Setting to not evaluate for now
MADELEINE 12/15: setting this to not evaluate for now since it’s blocking rendering ## Visualizing the most important predictors
STEPHANIE 12/13 - Then with the code chunk below, we should make some maps with the top ~3 variables as determined by the code chunk above
Sources
https://www.climatecentral.org/climate-matters/urban-heat-islands-2024
https://opendata.dc.gov/datasets/DCGIS::urban-tree-canopy-by-census-block-group-in-2020/about
Census ACS 5yr 2016 - 2020
https://www2.census.gov/programs-surveys/acs/tech_docs/accuracy/MultiyearACSAccuracyofData2020.pdf
http://bit.ly/3DnlfIh